http://genomicsclass.github.io/book/pages/rnaseq_gene_level.html http://www.bioconductor.org/help/workflows/rnaseqGene
RNA-seq is a technology that uses next-generation/high-throughput sequencing (NGS/HTS) to reveal a snapshot of the kind and amount of RNA molecules in a biological sample at a given moment in time.
Transcript quantification based on RNA-seq counts is unstable due to ambiguity and bias (not-uniform coverage along the length of the transcript due to technical bias).
Here, we will focus on comparing the expression levels of genes across different samples, by counting the number of reads which overlap the exons of genes defined by a known annotation of the reference genome.
The FPKM (fragments per kilobase of sequence per million mapped reads) is a normalized measure of expression level (having divided out the transcript length and the number of mapped reads).
TPM (transcripts per million), is a linear scaling of the FPKM, such that we would expect a gene with 1 TPM to have one molecule in a population of one million mRNA.
TPM is equal to: (FPKM / sum(FPKM)) * 1 million
CPM (counts per million)
Adapter trimming: Adapters that are used in the sequencing can sometimes show up in the read sequence. Because these adapter sequences are non-genomic, they can cause problems with read alignment. A number of software packages can be used to find and trim adapters in FASTQ files, e.g., Scythe, FASTX-Toolkit and cutadapt.
Quality trimming: The 3’ end of reads often has lower quality, and software like FASTX-Toolkit or cutadapt offers to trim away sequence which would likely interfere with read alignment (given that there is enough remaining 5’ sequence for unique alignments). The trade-off between quality trimming vs other methods of error correction is likely specific to the characteristics on individual datasets.
Length adjustment: When the length of paired-end fragments is less than 2 * read length, there can be a benefit from combining overlapping pairs into a single long read using software like FLASH.
Aligners generate SAM/BAM files containing the reads from each biological sample aligned to a reference genome from:
See also: Illumina iGenomes
A count matrix is a summarized version of an RNA-seq experiment which has genes along the rows and biological samples along the columns. The values in the matrix are the number of RNA-seq reads which could be uniquely aligned to the exons of a given gene for a given sample.
First, we define variables pointing to 1) a sample table (containing the metadata for each sample), 2) the BAM files (containing the reads from each biological sample aligned to a reference genome), and 3) a GTF file (containing the annotation for the reference genome that was used to generate the BAM files). Use the sample table to contruct the BAM files vector, so that the count matrix will be in the same order as the sample table.
Himes2014 describes the data from the airway package that are used in the following examples.
library(airway)
dir <- system.file("extdata", package = "airway", mustWork = TRUE)
csv_file <- file.path(dir, "sample_table.csv")
sample_table <- read.csv(csv_file, row.names = 1)
print(sample_table)
## SampleName cell dex albut Run avgLength Experiment
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358
## Sample BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677
bam_files <- file.path(dir, paste0(sample_table$Run, "_subset.bam"))
gtf_file <- file.path(dir, "Homo_sapiens.GRCh37.75_subset.gtf")
Next, we make a BamFileList object bam_file_list which wraps the BAM files. We can ignore the warning about matchCircularity.
library(Rsamtools)
## Loading required package: XVector
## Loading required package: Biostrings
bam_files_list <- BamFileList(bam_files) # see also yieldSize option if working in an environment with a limited amount of memory
Finally, we make a GRangesList object exons_by_gene which contains the exons for each gene.
library(GenomicFeatures)
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
##
## Attaching package: 'AnnotationDbi'
##
## The following object is masked from 'package:dplyr':
##
## select
txdb <- makeTxDbFromGFF(gtf_file, format = "gtf")
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
exons_by_gene <- exonsBy(txdb, by = "gene")
Next, we create a SummarizedExperiment object se which contains the counts for the reads in each BAM file in bam_files_list (columns) aligned to each gene in exons_by_gene (rows). We add the sample_table dataframe as column data. The column order is correct, because the bam_files_list was constructed from the sample_table.
library(GenomicAlignments)
##
## Attaching package: 'GenomicAlignments'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
se <- summarizeOverlaps(exons_by_gene, bam_files_list,
mode = "Union",
singleEnd = FALSE,
fragments = TRUE,
ignore.strand = TRUE)
head(colData(se))
## DataFrame with 6 rows and 0 columns
colData(se) <- DataFrame(sample_table)
head(colData(se))
## DataFrame with 6 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
rowRanges(se) # same as rowData()
## GRangesList object of length 20:
## $ENSG00000009724
## GRanges object with 18 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 1 [11086580, 11087705] - | 98 ENSE00000818830
## [2] 1 [11090233, 11090307] - | 99 ENSE00000472123
## [3] 1 [11090805, 11090939] - | 100 ENSE00000743084
## [4] 1 [11094885, 11094963] - | 101 ENSE00000743085
## [5] 1 [11097750, 11097868] - | 103 ENSE00003520086
## ... ... ... ... ... ... ...
## [14] 1 [11106948, 11107176] - | 111 ENSE00003467404
## [15] 1 [11106948, 11107176] - | 112 ENSE00003489217
## [16] 1 [11107260, 11107280] - | 113 ENSE00001833377
## [17] 1 [11107260, 11107284] - | 114 ENSE00001472289
## [18] 1 [11107260, 11107290] - | 115 ENSE00001881401
##
## ...
## <19 more elements>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(assay(se)) # assay(se) is the count matrix
| SRR1039508 | SRR1039509 | SRR1039512 | SRR1039513 | SRR1039516 | SRR1039517 | SRR1039520 | SRR1039521 | |
|---|---|---|---|---|---|---|---|---|
| ENSG00000009724 | 38 | 28 | 66 | 24 | 42 | 41 | 47 | 36 |
| ENSG00000116649 | 1004 | 1255 | 1122 | 1313 | 1100 | 1879 | 745 | 1536 |
| ENSG00000120942 | 218 | 256 | 233 | 252 | 269 | 465 | 207 | 400 |
| ENSG00000120948 | 2751 | 2080 | 3353 | 1614 | 3519 | 3716 | 2220 | 1990 |
| ENSG00000171819 | 4 | 50 | 19 | 543 | 1 | 10 | 14 | 1067 |
| ENSG00000171824 | 869 | 1075 | 1115 | 1051 | 944 | 1405 | 748 | 1590 |
We can use the featureCounts function in the Rsubread package as an alternative to the summarizeOverlaps function in the GenomicAlignments package:
library(Rsubread)
fc <- featureCounts(bam_files, annot.ext = gtf_file, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 1.18.0
##
## //========================== featureCounts setting ===========================\\
## || ||
## || Input files : 8 BAM files ||
## || P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## || P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## || P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## || P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## || P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## || P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## || P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## || P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## || ||
## || Output file : ./.Rsubread_featureCounts_pid10470 ||
## || Annotations : /Users/marcoe02/Library/R/3.2/library/airway ... ||
## || ||
## || Threads : 1 ||
## || Level : meta-feature level ||
## || Paired-end : yes ||
## || Strand specific : no ||
## || Multimapping reads : not counted ||
## || Multi-overlapping reads : not counted ||
## || ||
## || Chimeric reads : counted ||
## || Both ends mapped : not required ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
##
## //================================= Running ==================================\\
## || ||
## || Load annotation file /Users/marcoe02/Library/R/3.2/library/airway/extd ... ||
## || Features : 406 ||
## || Meta-features : 20 ||
## || Chromosomes : 1 ||
## || ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## || Paired-end reads are included. ||
## || Assign fragments (read pairs) to features... ||
## || Found reads that are not properly paired. ||
## || (missing mate or the mate is not the next read) ||
## || 2 reads have missing mates. ||
## || Input was converted to a format accepted by featureCounts. ||
## || Total fragments : 7142 ||
## || Successfully assigned fragments : 6649 (93.1%) ||
## || Running time : 0.14 minutes ||
## || ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## || Paired-end reads are included. ||
## || Assign fragments (read pairs) to features... ||
## || Found reads that are not properly paired. ||
## || (missing mate or the mate is not the next read) ||
## || 1 read has missing mates. ||
## || Input was converted to a format accepted by featureCounts. ||
## || Total fragments : 7200 ||
## || Successfully assigned fragments : 6712 (93.2%) ||
## || Running time : 0.13 minutes ||
## || ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## || Paired-end reads are included. ||
## || Assign fragments (read pairs) to features... ||
## || Found reads that are not properly paired. ||
## || (missing mate or the mate is not the next read) ||
## || 2 reads have missing mates. ||
## || Input was converted to a format accepted by featureCounts. ||
## || Total fragments : 8536 ||
## || Successfully assigned fragments : 7910 (92.7%) ||
## || Running time : 0.15 minutes ||
## || ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## || Paired-end reads are included. ||
## || Assign fragments (read pairs) to features... ||
## || Found reads that are not properly paired. ||
## || (missing mate or the mate is not the next read) ||
## || 1 read has missing mates. ||
## || Input was converted to a format accepted by featureCounts. ||
## || Total fragments : 7544 ||
## || Successfully assigned fragments : 7044 (93.4%) ||
## || Running time : 0.14 minutes ||
## || ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## || Paired-end reads are included. ||
## || Assign fragments (read pairs) to features... ||
## || Found reads that are not properly paired. ||
## || (missing mate or the mate is not the next read) ||
## || 2 reads have missing mates. ||
## || Input was converted to a format accepted by featureCounts. ||
## || Total fragments : 8818 ||
## || Successfully assigned fragments : 8261 (93.7%) ||
## || Running time : 0.15 minutes ||
## || ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## || Paired-end reads are included. ||
## || Assign fragments (read pairs) to features... ||
## || Found reads that are not properly paired. ||
## || (missing mate or the mate is not the next read) ||
## || 6 reads have missing mates. ||
## || Input was converted to a format accepted by featureCounts. ||
## || Total fragments : 11850 ||
## || Successfully assigned fragments : 11148 (94.1%) ||
## || Running time : 0.15 minutes ||
## || ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## || Paired-end reads are included. ||
## || Assign fragments (read pairs) to features... ||
## || Found reads that are not properly paired. ||
## || (missing mate or the mate is not the next read) ||
## || 6 reads have missing mates. ||
## || Input was converted to a format accepted by featureCounts. ||
## || Total fragments : 5877 ||
## || Successfully assigned fragments : 5415 (92.1%) ||
## || Running time : 0.13 minutes ||
## || ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## || Paired-end reads are included. ||
## || Assign fragments (read pairs) to features... ||
## || Found reads that are not properly paired. ||
## || (missing mate or the mate is not the next read) ||
## || 3 reads have missing mates. ||
## || Input was converted to a format accepted by featureCounts. ||
## || Total fragments : 10208 ||
## || Successfully assigned fragments : 9538 (93.4%) ||
## || Running time : 0.15 minutes ||
## || ||
## || Read assignment finished. ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
names(fc)
## [1] "counts" "annotation" "targets" "stat"
head(fc$counts) # fc$counts is the count matrix
| X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039508_subset.bam | X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039509_subset.bam | X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039512_subset.bam | X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039513_subset.bam | X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039516_subset.bam | X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039517_subset.bam | X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039520_subset.bam | X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039521_subset.bam | |
|---|---|---|---|---|---|---|---|---|
| ENSG00000175262 | 0 | 0 | 4 | 1 | 0 | 0 | 1 | 0 |
| ENSG00000215785 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000264181 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000207213 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000120948 | 2673 | 2031 | 3263 | 1570 | 3446 | 3615 | 2171 | 1949 |
| ENSG00000009724 | 38 | 28 | 66 | 24 | 42 | 41 | 47 | 36 |
We now load the full SummarizedExperiment object, counting reads over all the genes.
library(airway)
data(airway)
print(airway)
## class: SummarizedExperiment
## dim: 64102 8
## exptData(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowRanges metadata column names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
colData(airway)
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
rowRanges(airway)
## GRangesList object of length 64102:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] X [99883667, 99884983] - | 667145 ENSE00001459322
## [2] X [99885756, 99885863] - | 667146 ENSE00000868868
## [3] X [99887482, 99887565] - | 667147 ENSE00000401072
## [4] X [99887538, 99887565] - | 667148 ENSE00001849132
## [5] X [99888402, 99888536] - | 667149 ENSE00003554016
## ... ... ... ... ... ... ...
## [13] X [99890555, 99890743] - | 667156 ENSE00003512331
## [14] X [99891188, 99891686] - | 667158 ENSE00001886883
## [15] X [99891605, 99891803] - | 667159 ENSE00001855382
## [16] X [99891790, 99892101] - | 667160 ENSE00001863395
## [17] X [99894942, 99894988] - | 667161 ENSE00001828996
##
## ...
## <64101 more elements>
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome
head(assay(airway))
| SRR1039508 | SRR1039509 | SRR1039512 | SRR1039513 | SRR1039516 | SRR1039517 | SRR1039520 | SRR1039521 | |
|---|---|---|---|---|---|---|---|---|
| ENSG00000000003 | 679 | 448 | 873 | 408 | 1138 | 1047 | 770 | 572 |
| ENSG00000000005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000000419 | 467 | 515 | 621 | 365 | 587 | 799 | 417 | 508 |
| ENSG00000000457 | 260 | 211 | 263 | 164 | 245 | 331 | 233 | 229 |
| ENSG00000000460 | 60 | 55 | 40 | 35 | 78 | 63 | 76 | 60 |
| ENSG00000000938 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 0 |
First, note that different samples have been sequenced at different depths and we need to normalize for this in order to more easily compare samples.
sort(round(colSums(assay(airway))/1e6, 2))
## SRR1039513 SRR1039509 SRR1039520 SRR1039508 SRR1039521 SRR1039516
## 15.16 18.81 19.13 20.64 21.16 24.45
## SRR1039512 SRR1039517
## 25.35 30.82
In addition, with un-transformed gene counts, the high count genes have high variance. That is, in the following scatter plot, the points start out in a tight cone and then fan out toward the top right. This is a general property of counts generated from sampling processes, that the variance typically increases with the expected value. If these genes were included in a distance calculation (e.g., for sample clustering), the high variance at the high count range could overwhelm the signal at the lower count range. We will explore different transformations below to correct for this effect (e.g., the log2 and rlog transformations).
plot(assay(airway)[, 1:2], cex = .1, main = "untransf, not norm")
We use the DESeq2 package to normalize for sequencing depth. The DESeqDataSet object is an extension of the SummarizedExperiment object, with a few changes. The matrix in assay is now accessed with counts and the elements of this matrix are required to be non-negative integers (0,1,2,…).
We specify an experimental design here, for later use, although for estimating size factors, we could just use ~ 1 as a default design. The variables are columns of the colData, and the + indicates that for differential expression analysis we want to compare levels of dex while controlling for the cell differences.
library(DESeq2)
## Loading required package: Rcpp
## Loading required package: RcppArmadillo
dds <- DESeqDataSet(airway, design = ~ cell + dex)
A DESeqDataSet object can also be constructed using a count matrix and column/sample data!
dds2 <- DESeqDataSetFromMatrix(fc$counts, colData = sample_table, design = ~ cell + dex)
See also: DESeqDataSetFromHTSeqCount()
Please not that the last variable in the design (in this case the dex variable/treatment) is used by default for building the results table!
Estimate the size factors to account for differences in sequencing depth. The size factors are similar to colum sums, but are more robust.
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## 1.0236476 0.8961667 1.1794861 0.6700538 1.1776714 1.3990365
## SRR1039520 SRR1039521
## 0.9207787 0.9445141
colSums(counts(dds))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## 20637971 18809481 25348649 15163415 24448408 30818215
## SRR1039520 SRR1039521
## 19126151 21164133
plot(sizeFactors(dds), colSums(counts(dds)), cex = .1)
abline(lm(colSums(counts(dds)) ~ sizeFactors(dds) + 0))
Size factors are calculated as the median ratio of each sample to a pseudo-sample (i.e., the geometric mean of all samples). In other words, for each sample, we take the exponent of the median of the log ratios (of each sample to a pseudo-sample). For example, the size factor for the first sample:
n.b.: log(a/b) = log(a) - log(b)
loggeomeans <- rowMeans(log(counts(dds)))
exp(median((log(counts(dds)[, 1]) - loggeomeans)[is.finite(loggeomeans)]))
## [1] 1.023648
sizeFactors(dds)[1]
## SRR1039508
## 1.023648
Examine the log2-transformed counts and the log2-transformed and depth-normalized counts (plus a pseudocount), excluding genes with zero counts across all samples.
rs = rowSums(counts(dds))
mypar(1,2)
boxplot(log2(counts(dds)[rs > 0, ] + 1), main = "log2-trans, not norm")
boxplot(log2(counts(dds, normalized = TRUE)[rs > 0, ] + 1), main = "log2-trans, depth-norm")
Make a matrix of log2 normalized counts (plus a pseudocount):
log2.norm.counts = log2(counts(dds, normalized = TRUE) + 1)
Make a scatterplot of the log2 normalized counts (plus a pseudocount) against each other. Note the fanning out of the points in the lower left corner (the opposite of the effect observed when using the untransformed counts).
plot(log2.norm.counts[, 1:2], cex = .1, main = "log2-trans, depth-norm")
Now we will use a more sophisticated transformation (regularized logarithm or rlog), which uses the variance model for count data to shrink together the log-transformed counts for genes with very low counts. For genes with medium and high counts, the rlog is very close to log2. Another transformation for stabilizing variance in the DESeq2 package is the appropriately named varianceStabilizingTransformation. These two tranformations are similar, although the rlog might perform better when the size factors vary widely.
rlog.dds <- rlog(dds)
rlog.norm.counts <- assay(rlog.dds)
plot(rlog.norm.counts[, 1:2], cex = .1, main = "rlog-trans, depth-norm")
We can examine the standard deviation of rows over the mean for the log2-transformed and depth-normalized counts (plus pseudocount) and the rlog transformation.
library(vsn)
mypar(1, 2)
meanSdPlot(log2.norm.counts, ranks = FALSE, ylim = c(0, 3), main = "log2")
meanSdPlot(rlog.norm.counts, ranks = FALSE, ylim = c(0, 3), main = "rlog")
Note that the genes with high variance for the log2 normalized counts (plus a pseudocount) transformation come from the genes with lowest mean. If these genes were included in a distance calculation (e.g., for sample clustering), the high variance at the low count range could overwhelm the signal at the higher count range.
With normalized and variance-stabilized gene counts across samples, we can use principal components analysis (PCA) as a useful diagnostics for examining relationships between samples.
Using the log2 normalized counts (plus a pseudocount) transformation:
# To speed up the PCA, include only the 500 genes with the most variance across samples
rowvar <- apply(log2.norm.counts, 1, var)
topgenes_idx <- head(order(rowvar, decreasing = TRUE), 500)
pc <- prcomp(t(log2.norm.counts[topgenes_idx, ])) # transpose because prcomp expects samples to be in rows
mypar()
plot(pc$x[,1], pc$x[,2], col = colData(dds)$dex, pch = as.integer(colData(dds)$cell))
Using the rlog transformation (and the plotPCA function in DESeq2 instead of the standard plot function):
plotPCA(rlog.dds, intgroup = "dex")
plotPCA(rlog.dds, intgroup = c("dex", "cell"))
We can make this plot even nicer using the ggplot2 library to customize the output of the plotPCA function.
library(ggplot2)
(data <- plotPCA(rlog.dds, intgroup = c("dex", "cell"), returnData = TRUE))
| PC1 | PC2 | group | dex | cell | name | |
|---|---|---|---|---|---|---|
| SRR1039508 | -14.331359 | -4.208796 | untrt : N61311 | untrt | N61311 | SRR1039508 |
| SRR1039509 | 6.754169 | -2.245244 | trt : N61311 | trt | N61311 | SRR1039509 |
| SRR1039512 | -8.130393 | -3.952904 | untrt : N052611 | untrt | N052611 | SRR1039512 |
| SRR1039513 | 14.505648 | -2.941863 | trt : N052611 | trt | N052611 | SRR1039513 |
| SRR1039516 | -11.891409 | 13.735002 | untrt : N080611 | untrt | N080611 | SRR1039516 |
| SRR1039517 | 8.373975 | 17.823844 | trt : N080611 | trt | N080611 | SRR1039517 |
| SRR1039520 | -9.965898 | -10.014674 | untrt : N061011 | untrt | N061011 | SRR1039520 |
| SRR1039521 | 14.685269 | -8.195366 | trt : N061011 | trt | N061011 | SRR1039521 |
(percentVar <- 100 * round(attr(data, "percentVar"), 2))
## [1] 39 27
makeLab <- function(x, pc) paste0("PC", pc, ": ", x, "% variance")
ggplot(data, aes(PC1, PC2, col = dex, shape = cell)) + geom_point() + xlab(makeLab(percentVar[1], 1)) + ylab(makeLab(percentVar[2], 2)) + theme_bw()
In addition, we can plot a hierarchical clustering tree based on a Euclidean distance matrix between samples.
mypar(1,2)
plot(hclust(dist(t(log2.norm.counts))), labels = colData(dds)$dex, main = "log2")
plot(hclust(dist(t(rlog.norm.counts))), labels = colData(rlog.dds)$dex, main = "rlog")
In the absence of biological and technical variability, RNA-seq counts behave much like Poisson-distributed random variables, with the variability coming only from random sampling.
Often, we are interested in the log ratios of counts (one ratio for each gene/exon/transcript/feature of interest) to compare two samples or conditions. Beware that the log ratios of counts of two Poisson-distributed random variables have higher variance at low lambdas (i.e., transcripts with low abundance).
In reality (i.e., in the presence of biological and technical variability) the behavior of RNA-seq counts can be better approximated by negative binomial (NB)-distributed random variables (for which the mean exceeds the variance), with the variability coming from random sampling and additional sources causing over-dispersion with respect to the Poisson distribution (for which the mean equals the variance).
mypar(3,1)
n <- 10000
brks <- 0:400
hist(rpois(n, lambda = 100), main = "Poisson/NB, disp = 0", xlab = "", breaks = brks, col = "black")
hist(rnbinom(n, mu = 100, size = 1/.01), main = "NB, disp = 0.01", xlab = "", breaks = brks, col = "black")
hist(rnbinom(n, mu = 100, size = 1/.1), main = "NB, disp = 0.1", xlab = "", breaks = brks, col = "black")
The sqrt of the dispersion coefficient in the NB distribution is the coefficient of variation (i.e., stdev/mu) after subtracting the variance due to random sampling (i.e., the coefficient of variation of technical+biological variability), which is a measure of over-dispersion with respect to the Poisson distribution.
Raw RNA-seq counts:
\[K_{ij} \sim \text{NB}(\mu_{ij} = s_{ij} q_{ij} )\]
Normalized RNA-seq counts:
\[\frac{K_{ij}}{s_{ij}} \sim \mathcal{L}(\mu_{ij} = q_{ij})\]
When normalizing, i.e., scaling up or down, Poisson-distributed random variables/the raw RNA-seq counts, we break the implicit link between the mean and the variance.
mypar(3,1)
n <- 10000
brks <- 0:400
hist(rpois(n,100), main = "", xlab = "", breaks=brks, col = "black")
hist(rpois(n,1000)/10, main = "", xlab = "", breaks=brks, col = "black")
hist(rpois(n,10)*10, main = "", xlab = "", breaks=brks, col = "black")
This becomes a problem when we have a small number of biological replicates and thus we cannot robustly estimate the within-group variance. Thus, we can get a benefit of information from using the raw RNA-seq counts, and incorporating normalization factors on the right side of the equation above instead.
We are interested in the effect of the dex treatment expressed as log2 ratio of treated over untreated/control and thus we need to make sure that the untreated/control level of the dex variable/treatment is the first level.
levels(dds$dex)
## [1] "trt" "untrt"
dds$dex <- relevel(dds$dex, "untrt")
levels(dds$dex)
## [1] "untrt" "trt"
We then run the DESeq2 model and inspect the results table (built according to the last variable in the design, in this case the dex variable/treatment).
dds <- DESeq(dds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
head(res)
## log2 fold change (MAP): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 708.6021697 -0.37424998 0.09873107 -3.7906000
## ENSG00000000005 0.0000000 NA NA NA
## ENSG00000000419 520.2979006 0.20215551 0.10929899 1.8495642
## ENSG00000000457 237.1630368 0.03624826 0.13684258 0.2648902
## ENSG00000000460 57.9326331 -0.08523371 0.24654402 -0.3457140
## ENSG00000000938 0.3180984 -0.11555969 0.14630532 -0.7898530
## pvalue padj
## <numeric> <numeric>
## ENSG00000000003 0.0001502838 0.001217611
## ENSG00000000005 NA NA
## ENSG00000000419 0.0643763851 0.188306353
## ENSG00000000457 0.7910940556 0.907203245
## ENSG00000000460 0.7295576905 0.874422374
## ENSG00000000938 0.4296136373 NA
# padj is calculated using BH -> FDR
table(res$padj < 0.1)
| FALSE | TRUE |
|---|---|
| 12644 | 4897 |
summary(res)
##
## out of 33469 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 2646, 7.9%
## LFC < 0 (down) : 2251, 6.7%
## outliers [1] : 0, 0%
## low counts [2] : 15928, 48%
## (mean count < 5.3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Please note that the default behavior is to calculate the adjusted pval using BH and to filter the results using a FDR of 10%. However, it is possible to change the FDR by specifying the alpha' argument of theresults` function.
We can visualize the results using an MA-plot, where the y-axis M is the log2 fold change, the x-axis A is the mean raw count across all samples.
mypar()
plotMA(res, ylim = c(-4, 4))
Please note that the default statistical test looks for log2 fold changes that are different than zero (i.e., the null hypothesis of no change in gene expression), which are marked in red. However, it is possible to only look for log2 fold changes that are higher than a certain threshold by specifying the lfcThreshold' argument of theresults` function.
res2 <- results(dds, lfcThreshold = 1)
plotMA(res2, ylim = c(-4, 4))
DESeq2 automatically filters out low count genes, and chooses a threshold by optimizing the number of genes with adjusted p-value < alpha. Adjusted p-values are larger than the original p-value. The amount that the adjusted p-value is raised depends on the number of tests. If we can remove genes which have no power to find differences (genes with very small counts, where the differences are obscured by noise), we can increase power because we run fewer tests and thus have smaller adjusted p-values. A review of this filtering strategy is presented in Bourgon2010.
DESeq2 also automatically filters out (or with more samples, replaces) genes with outlier counts, to leave only genes with consistent differences across condition. Both DESeq2 filters can be customized or turned off by the arguments listed in the summary function output.
We mentioned that gene counts are proportional to gene expression, as well as sequencing depth, average transcript length and other technical bias factors. In this analysis we accounted for sequencing depth changes across samples, and we assumed that the other factors cancelled out when calculating fold changes across samples. However, it is possible to account for these differences across samples as well, by using other software to estimate sample-specific bias parameters (e.g., the CQN or EDASeq package) and provide this information to the normalizationFactors function before running the DESeq function. We could also use software like RSEM to estimate the average transcript length for each gene and sample and provide this information to the estimateSizeFactors function.
Next, we sort the results by adjusted pval in order to inspect the top DEGs…
res.sorted_by_padj <- res[order(res$padj), ]
head(res.sorted_by_padj)
## log2 fold change (MAP): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000152583 997.4398 4.316100 0.1724127 25.03354
## ENSG00000165995 495.0929 3.188698 0.1277441 24.96160
## ENSG00000101347 12703.3871 3.618232 0.1499441 24.13054
## ENSG00000120129 3409.0294 2.871326 0.1190334 24.12201
## ENSG00000189221 2341.7673 3.230629 0.1373644 23.51868
## ENSG00000211445 12285.6151 3.552999 0.1589971 22.34631
## pvalue padj
## <numeric> <numeric>
## ENSG00000152583 2.637881e-138 4.627108e-134
## ENSG00000165995 1.597973e-137 1.401503e-133
## ENSG00000101347 1.195378e-128 6.441180e-125
## ENSG00000120129 1.468829e-128 6.441180e-125
## ENSG00000189221 2.627083e-122 9.216332e-119
## ENSG00000211445 1.311440e-110 3.833996e-107
…and plot the counts for the top DEG.
plotCounts(dds, gene = which.min(res$padj), intgroup = "dex")
data <- plotCounts(dds, gene = which.min(res$padj), intgroup = c("dex", "cell"), returnData = TRUE)
head(data)
| count | dex | cell | |
|---|---|---|---|
| SRR1039508 | 61.06772 | untrt | N61311 |
| SRR1039509 | 2276.86214 | trt | N61311 |
| SRR1039512 | 84.43486 | untrt | N052611 |
| SRR1039513 | 1749.61320 | trt | N052611 |
| SRR1039516 | 85.41333 | untrt | N080611 |
| SRR1039517 | 1375.73220 | trt | N080611 |
library(ggplot2)
ggplot(data, aes(x = dex, y = count, col = cell)) +
geom_point(position = position_jitter(width = .1, height = 0)) +
scale_y_log10() +
theme_bw()
ggplot(data, aes(x = dex, y = count, col = cell, group = cell)) +
geom_point() +
geom_line() +
scale_y_log10() +
theme_bw()
To display the counts for more than one gene we can use a heatmap.
library(pheatmap)
topgenes <- head(rownames(res.sorted_by_padj), 20)
print(topgenes)
## [1] "ENSG00000152583" "ENSG00000165995" "ENSG00000101347"
## [4] "ENSG00000120129" "ENSG00000189221" "ENSG00000211445"
## [7] "ENSG00000157214" "ENSG00000162614" "ENSG00000125148"
## [10] "ENSG00000154734" "ENSG00000139132" "ENSG00000162493"
## [13] "ENSG00000162692" "ENSG00000179094" "ENSG00000134243"
## [16] "ENSG00000163884" "ENSG00000178695" "ENSG00000146250"
## [19] "ENSG00000198624" "ENSG00000148848"
# use rlog-transformed and depth-normalized counts instead of the raw counts for this kind of viz
mat <- rlog.norm.counts[topgenes, ]
# subtract the mean to generate a more uniform viz that is not driven by mean level of gene expression but rather displays how gene expression varies about the mean level (irrespective of what that is)
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(dds)[, c("dex", "cell")])
pheatmap(mat, annotation_col = df)
We can also add annotations to the results.
library(org.Hs.eg.db)
## Loading required package: DBI
keytypes(org.Hs.eg.db)
## [1] "ENTREZID" "PFAM" "IPI" "PROSITE"
## [5] "ACCNUM" "ALIAS" "ENZYME" "MAP"
## [9] "PATH" "PMID" "REFSEQ" "SYMBOL"
## [13] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [17] "GENENAME" "UNIPROT" "GO" "EVIDENCE"
## [21] "ONTOLOGY" "GOALL" "EVIDENCEALL" "ONTOLOGYALL"
## [25] "OMIM" "UCSCKG"
head(rownames(dds))
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"
anno <- select(org.Hs.eg.db, keys = topgenes, columns = c("SYMBOL", "GENENAME"), keytype = "ENSEMBL")
anno[match(topgenes, anno$ENSEMBL), ] # see also the mapIds function
| ENSEMBL | SYMBOL | GENENAME | |
|---|---|---|---|
| 1 | ENSG00000152583 | SPARCL1 | SPARC-like 1 (hevin) |
| 2 | ENSG00000165995 | CACNB2 | calcium channel, voltage-dependent, beta 2 subunit |
| 3 | ENSG00000101347 | SAMHD1 | SAM domain and HD domain 1 |
| 4 | ENSG00000120129 | DUSP1 | dual specificity phosphatase 1 |
| 5 | ENSG00000189221 | MAOA | monoamine oxidase A |
| 6 | ENSG00000211445 | GPX3 | glutathione peroxidase 3 |
| 7 | ENSG00000157214 | STEAP2 | STEAP family member 2, metalloreductase |
| 8 | ENSG00000162614 | NEXN | nexilin (F actin binding protein) |
| 9 | ENSG00000125148 | MT2A | metallothionein 2A |
| 10 | ENSG00000154734 | ADAMTS1 | ADAM metallopeptidase with thrombospondin type 1 motif, 1 |
| 11 | ENSG00000139132 | FGD4 | FYVE, RhoGEF and PH domain containing 4 |
| 12 | ENSG00000162493 | PDPN | podoplanin |
| 13 | ENSG00000162692 | VCAM1 | vascular cell adhesion molecule 1 |
| 14 | ENSG00000179094 | PER1 | period circadian clock 1 |
| 16 | ENSG00000134243 | SORT1 | sortilin 1 |
| 17 | ENSG00000163884 | KLF15 | Kruppel-like factor 15 |
| 18 | ENSG00000178695 | KCTD12 | potassium channel tetramerization domain containing 12 |
| 19 | ENSG00000146250 | PRSS35 | protease, serine, 35 |
| 20 | ENSG00000198624 | CCDC69 | coiled-coil domain containing 69 |
| 21 | ENSG00000148848 | ADAM12 | ADAM metallopeptidase domain 12 |
We can also re-analyze the data (i.e., build diff results tables) by changing the design as shown above or by specifying the contrast' argument of theresults` function, as shown below to look at the diff between two cell lines.
results(dds, contrast = c("cell", "N61311", "N052611"))
## log2 fold change (MAP): cell N61311 vs N052611
## Wald test p-value: cell N61311 vs N052611
## DataFrame with 64102 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 708.60217 -0.20680752 0.1368030 -1.5117175 0.1306057
## ENSG00000000005 0.00000 NA NA NA NA
## ENSG00000000419 520.29790 -0.06389491 0.1490723 -0.4286169 0.6682020
## ENSG00000000457 237.16304 0.06428564 0.1827151 0.3518355 0.7249617
## ENSG00000000460 57.93263 0.34319847 0.2840573 1.2082017 0.2269697
## ... ... ... ... ... ...
## LRG_94 0 NA NA NA NA
## LRG_96 0 NA NA NA NA
## LRG_97 0 NA NA NA NA
## LRG_98 0 NA NA NA NA
## LRG_99 0 NA NA NA NA
## padj
## <numeric>
## ENSG00000000003 0.4342936
## ENSG00000000005 NA
## ENSG00000000419 0.8907923
## ENSG00000000457 0.9138413
## ENSG00000000460 0.5752896
## ... ...
## LRG_94 NA
## LRG_96 NA
## LRG_97 NA
## LRG_98 NA
## LRG_99 NA
results(dds, contrast = c("dex", "trt", "untrt"))
## log2 fold change (MAP): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 64102 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 708.60217 -0.37424998 0.09873107 -3.7906000
## ENSG00000000005 0.00000 NA NA NA
## ENSG00000000419 520.29790 0.20215551 0.10929899 1.8495642
## ENSG00000000457 237.16304 0.03624826 0.13684258 0.2648902
## ENSG00000000460 57.93263 -0.08523371 0.24654402 -0.3457140
## ... ... ... ... ...
## LRG_94 0 NA NA NA
## LRG_96 0 NA NA NA
## LRG_97 0 NA NA NA
## LRG_98 0 NA NA NA
## LRG_99 0 NA NA NA
## pvalue padj
## <numeric> <numeric>
## ENSG00000000003 0.0001502838 0.001217611
## ENSG00000000005 NA NA
## ENSG00000000419 0.0643763851 0.188306353
## ENSG00000000457 0.7910940556 0.907203245
## ENSG00000000460 0.7295576905 0.874422374
## ... ... ...
## LRG_94 NA NA
## LRG_96 NA NA
## LRG_97 NA NA
## LRG_98 NA NA
## LRG_99 NA NA
results(dds)
## log2 fold change (MAP): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 64102 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 708.60217 -0.37424998 0.09873107 -3.7906000
## ENSG00000000005 0.00000 NA NA NA
## ENSG00000000419 520.29790 0.20215551 0.10929899 1.8495642
## ENSG00000000457 237.16304 0.03624826 0.13684258 0.2648902
## ENSG00000000460 57.93263 -0.08523371 0.24654402 -0.3457140
## ... ... ... ... ...
## LRG_94 0 NA NA NA
## LRG_96 0 NA NA NA
## LRG_97 0 NA NA NA
## LRG_98 0 NA NA NA
## LRG_99 0 NA NA NA
## pvalue padj
## <numeric> <numeric>
## ENSG00000000003 0.0001502838 0.001217611
## ENSG00000000005 NA NA
## ENSG00000000419 0.0643763851 0.188306353
## ENSG00000000457 0.7910940556 0.907203245
## ENSG00000000460 0.7295576905 0.874422374
## ... ... ...
## LRG_94 NA NA
## LRG_96 NA NA
## LRG_97 NA NA
## LRG_98 NA NA
## LRG_99 NA NA
In addition, we can use surrogate variable analysis (SVA) to investigate hidden structure due to, e.g., batch effects or other surrogate variables.
Because, in our example, we know that our dataset includes data from two diff cell lines, we could just use the DESeq function with a ~ cell + dex model design to look for dex/treatment-specific differences controlling for the cell line of origin. But suppose we were given this data without the cell line information. We could use SVA to try to identify this hidden structure.
library(sva)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
##
## The following object is masked from 'package:Biostrings':
##
## collapse
##
## The following object is masked from 'package:IRanges':
##
## collapse
##
## The following object is masked from 'package:dplyr':
##
## collapse
##
## This is mgcv 1.8-7. For overview type 'help("mgcv-package")'.
## Loading required package: genefilter
##
## Attaching package: 'genefilter'
##
## The following object is masked from 'package:base':
##
## anyNA
idx <- rowMeans(counts(dds)) > 0
dat <- counts(dds)[idx, ]
mod <- model.matrix(~ dex, colData(dds)) # full design
mod0 <- model.matrix(~ 1, colData(dds)) # reduced design
svseq <- svaseq(dat, mod, mod0, n.sv = 2)
## Number of significant surrogate variables is: 2
## Iteration (out of 5 ):1 2 3 4 5
plot(svseq$sv[, 1], svseq$sv[, 2], col = dds$cell, pch = 16)
legend("bottomright", levels(dds$cell), pch=16, col=1:4)
dds.sva <- dds
dds.sva$SV1 <- svseq$sv[, 1]
dds.sva$SV2 <- svseq$sv[, 2]
design(dds.sva) <- ~ SV1 + SV2 + dex
dds.sva <- DESeq(dds.sva)
## using pre-existing size factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
The DEXSeq package offers differential testing of exon usage within each gene. We omit the python script calls for preparing the GFF file (from the GTF file) and the count tables (from the SAM/BAM files), but these can be found in the vignette along with additional information regarding the workflow outlined below.
library(pasilla)
dir <- system.file("extdata", package = "pasilla", mustWork = TRUE)
count_tables <- list.files(dir, pattern = "fb.txt$", full.names = TRUE)
gff_file <- list.files(dir, pattern = ".gff$", full.names = TRUE)
sample_table = data.frame(
row.names = c( "treated1", "treated2", "treated3",
"untreated1", "untreated2", "untreated3", "untreated4" ),
condition = c("knockdown", "knockdown", "knockdown",
"control", "control", "control", "control" ),
libType = c( "single-end", "paired-end", "paired-end",
"single-end", "single-end", "paired-end", "paired-end" ) )
print(sample_table)
## condition libType
## treated1 knockdown single-end
## treated2 knockdown paired-end
## treated3 knockdown paired-end
## untreated1 control single-end
## untreated2 control single-end
## untreated3 control paired-end
## untreated4 control paired-end
library(DEXSeq)
## Loading required package: BiocParallel
##
## Attaching package: 'DEXSeq'
##
## The following object is masked from 'package:Rsubread':
##
## featureCounts
dxd = DEXSeqDataSetFromHTSeq(count_tables,
sampleData = sample_table,
design = ~ sample + exon + condition:exon,
flattenedfile = gff_file )
## converting counts to integer mode
# Subset to a much reduced number of genes (just for demontration, typically you would just run the analysis on the whole dataset)
geneids_in_subset <- read.table(file.path(dir, "geneIDsinsubset.txt"), as.is = TRUE)[[1]]
head(geneids_in_subset)
## [1] "FBgn0250871" "FBgn0015278" "FBgn0026777" "FBgn0051021" "FBgn0032979"
## [6] "FBgn0039914"
dxd <- dxd[geneIDs(dxd) %in% geneids_in_subset, ]
dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd)
## using supplied model matrix
## using supplied model matrix
dxd <- testForDEU(dxd)
## using supplied model matrix
dxd <- estimateExonFoldChanges(dxd, fitExpToVar = "condition")
dxr <- DEXSeqResults(dxd)
plotMA(dxr, cex = 0.8)
plotDEXSeq(dxr, "FBgn0010909", legend = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)
plotDEXSeq(dxr, "FBgn0010909", displayTranscripts = TRUE, legend = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)
Here we show the exploratory plots offered by the cummeRbund package, which provides an R interface to Cufflinks output files. Additional information regarding the workflow outlined below can be found in the vignette.
First, we load in a directory containing the results from a Cufflinks analysis.
library(cummeRbund)
## Loading required package: reshape2
## Loading required package: fastcluster
##
## Attaching package: 'fastcluster'
##
## The following object is masked from 'package:stats':
##
## hclust
##
## Loading required package: rtracklayer
## Loading required package: Gviz
## Loading required package: grid
##
## Attaching package: 'cummeRbund'
##
## The following object is masked from 'package:DESeq2':
##
## fpkm
##
## The following objects are masked from 'package:GenomicFeatures':
##
## features, genes, promoters
##
## The following object is masked from 'package:Biobase':
##
## samples
##
## The following object is masked from 'package:GenomicRanges':
##
## promoters
##
## The following object is masked from 'package:IRanges':
##
## promoters
##
## The following object is masked from 'package:BiocGenerics':
##
## conditions
##
## The following object is masked from 'package:dplyr':
##
## count
dir <- system.file("extdata", package = "cummeRbund")
gtf_file <- system.file("extdata/chr1_snippet.gtf", package = "cummeRbund")
cuff <- readCufflinks(dir, gtfFile = gtf_file, genome = "hg19", rebuild = TRUE)
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
Boxplots of expression (FPKM) at the gene and isoform level:
csBoxplot(genes(cuff)) + theme_bw()
csBoxplot(genes(cuff), replicates = TRUE) + theme_bw()
csBoxplot(isoforms(cuff), replicates = TRUE) + theme_bw()
Scatterplot matrix of gene and isoform level expression:
csScatterMatrix(genes(cuff)) + theme_bw()
csScatterMatrix(isoforms(cuff)) + theme_bw()
Sample dendrograms using Jensen-Shannon distances:
csDendro(genes(cuff), replicates = TRUE) + theme_bw()
## NULL
csDendro(isoforms(cuff), replicates = TRUE) + theme_bw()
## NULL
MA-plot comparing two conditions:
MAplot(genes(cuff), "hESC", "Fibroblasts") + theme_bw()
## Warning: Removed 54 rows containing missing values (geom_point).
MAplot(isoforms(cuff), "hESC", "Fibroblasts") + theme_bw()
## Warning: Removed 187 rows containing missing values (geom_point).
A “volcano plot” matrix. Each volcano plot is the -log10(p-value) over the log fold change.
csVolcanoMatrix(genes(cuff)) + theme_bw()
csVolcanoMatrix(isoforms(cuff)) + theme_bw()
Other software that performs isoform-level abundance estimation:
https://support.bioconductor.org/p/69643
For 3d pca plot:
y <- cpm(dge, log=TRUE, prior.count=3)
pc <- princomp(y)
library(rgl)
plot3d(pc$scores[,1:3])
Filtering out consistently non-expressed probe-sets by far the most common filtering step, because keeping probe-sets in your analysis that are never expressed is hardly ever useful. Apart from that, it is better not to filter unless you know what you’re doing. If you plan to use limma for the differential expression analysis, then filtering is not much needed, especially if you use trend=TRUE in the eBayes step. I personally prefer to keep it simple. Do some some simple filtering on mean log-expression, or don’t filter at all.
https://support.bioconductor.org/p/69467
There are many things you can do to increase power to detect DE genes in limma. Two things that will probably make a big difference are:
Filter out probes that aren’t expressed in your experiment, of which there are likely to be many.
Turn on robust eBayes estimation, because the rma() algorithm is returning identical normalized values for some of the probe sets.
For example,
design <- model.matrix(~factor(c(1,2,1,2)))
fit <- lmFit(eset, design)
keep <- fit$Amean > median(fit$Amean)
fit <- eBayes(fit[keep,], robust=TRUE, trend=TRUE)
topTable(fit, coef=2)
Beware that the variance filtering methods recommended by the article linked to must only be used with ordinary t-tests. They should not be used with limma. In a limma pipeline, filtering by mean log-expression is simpler and better.
he Bourgon et al article itself says that variance filtering is not appropriate with limma, but unfortunately not very prominently so many people miss it. Personally, I think that variance filtering is problematic in many or most pipelines, not just with limma. I don’t think it would be a good idea with any expression platform that produces a decreasing mean-variance relationship, as most do. Nor is it appropropriate with any of the leading microarray statistical tests (limma, SAM, maanova, CyberT etc). To me, the Affymetrix-rma-ttest pipeline is one of the very few where variance filtering is beneficial, because it unusually produces an increasing mean-variance relationship at low intensities. Yet even here I would hope to get more benefit out of limma instead.
As far as what one should do, most people agree that filtering low expressed or non-expressed probes is a good idea – it makes intuitive sense from both statistical and biological points of view. I must admit I never thought that a journal article was necessary on this, but maybe I was wrong.
Don’t do variance filtering – then the problems you mention will all disappear. Variance filtering is hardly ever a good idea, and almost certainly not with RNA-seq data.
The limma algorithm analyses the spread of the genewise variances. Any filtering method based on genewise variances will change the distribution of variances, will interfere with the limma algorithm and hence will give poor results.
Like most people, I recommend filtering out genes that don’t appear to be expressed in any sample. See for example Case studies 15.3 or 15.4 in the limma User’s Guide.
However you will find if you use eBayes(fit,trend=TRUE) instead of the usual eBayes(fit) that limma gives pretty good results regardless how much filtering you do, provided of course that the filtering is on expression and not on variance.
The literature tends to say that the reason for filtering is to reduce the amount of multiple testing, but in truth the increase in power from this is only slight. The more important reason for filtering in most applications is to remove highly variable genes at low intensities. The importance of filtering is highly dependent on how you pre-processed your data. Filtering is less important if you (i) use a good background correction or normalising method that damps down variability at low intensities and (ii) use eBayes(trend=TRUE) which accommodates a mean-variance trend.
http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf http://bioinf.wehi.edu.au/RNAseqCaseStudy/
Dear Gordon
The literature tends to say that the reason for filtering is to reduce the amount of multiple testing, but in truth the increase in power from this is only slight. The more important reason for filtering in most applications is to remove highly variable genes at low intensities. The importance of filtering is highly dependent on how you pre-processed your data. Filtering is less important if you (i) use a good background correction or normalising method that damps down variability at low intensities and (ii) use eBayes(trend=TRUE) which accommodates a mean-variance trend.
With all respect, I think this paragraph mixes up two separate issues and can benefit from clarification.
While literature can probably be found to support any statement, the above-cited reason is indeed bogus when multiple testing is performed with an FDR objective. The paper by Bourgon et al. motivates filtering differently, namely by using a filter criterion that is independent of the test statistic under the null (thus does not affect type-I error; some subtlety is discussed in that paper) but dependent under the alternative (thus improves power).
“Highly variable genes at low intensities” are indeed a problem of bad preprocessing and are better dealt with at that level, not by filtering. Nowadays, the commonly used methods for expression microarray or RNA-Seq analysis that I am aware of avoid that problem.
The question when & how independent filtering (as in 1) is beneficial is quite unrelated to preprocessing. You are right that FDR is a property of the whole selected gene list, not of individual genes, and that different approaches exist for spending the type-I error budget wisely, by weighting different genes differently; of which independent filtering is one and trended eBayes (which is not the default option in limma) may be another.
Reference: Bourgon et al. PNAS 2010: http://www.pnas.org/content/107/21/9546
There are methods in the genefilter package that can be used to filter data, and you can also remove probesets based on present/absent calls (http://www.bioconductor.org/packages/release/bioc/html/panp.html).
library(oligo) library(limma) library(genefilter)
http://www.gettinggeneticsdone.com/2014/05/r-volcano-plots-to-visualize-rnaseq-microarray.html